[dev-v2.9] Add ServiceMonitor for scraping fleet-controller in `rancher-monitoring` chart #3949

p-se · 2024-05-21T08:13:21Z

Issue:

The direct issue is this one: rancher/fleet#2295
The whole story is here: rancher/fleet#1408

The PR that introduced metrics into fleet: rancher/fleet#2172

The changes have been merged into Fleet v0.10.0-rc.13. Fleet 0.10 is planned to be released with Rancher 2.9.

Problem

Enabling further additions to monitoring that are related to the newly introduced fleet metrics, for which reasons Prometheus needs to scrape the data of the fleet-controllers by creating an additional ServiceMonitor which points to the Kubernetes services created by the fleet chart, which in turn point to the fleet-controller metrics.

Solution

An additional ServiceMonitor needs to be created when the rancher-monitoring chart is installed, so that the thereby installed Prometheus instance is automatically configured to scrape the data of the fleet-controllers.

This enables further additions of monitoring capabilities to Rancher using the rancher-monitoring chart, for instance the addition of Prometheus alerts or Grafana dashboards. The latter may be embedded into Rancher, similarly as the Grafana dashboards are already embedded into Rancher and displayed through the Rancher UI when the rancher-monitoring chart is installed.

Testing

On a cluster with Rancher and a fleet version >= v0.10.0-rc13, install the rancher-monitoring chart that includes the changes of this PR.
Open the Prometheus UI, navigate to Targets and check for fleet-controller.
If metrics are to be tested with sharding in Fleet enabled, which also is a feature introduced first in v0.10.0-rc.13, make sure you use a fleet version which has metrics: make sure metrics work well with sharding fleet#2420 integrated, which, at the time of writing is not yet in an RC of fleet. Also, fleet needs to be deployed with sharding enabled as described in the fleet-docs.

Engineering Testing

Manual Testing

Performed as described in Testing, including testing with sharding enabled in fleet.

Automated Testing

The initial PR adds E2E tests that check the fleet-controller exposed metrics through the helm chart generated services (when fleet is installed). Those tests do not cover the usage of a ServiceMonitor as introduced in this PR. Further PRs have followed to extend and improve testing of metrics in fleet:

QA Testing Considerations

Regressions Considerations

The probability of this change introducing regressions is low, as it simply extends already implemented functionality by a rather simple resource, which is part of the rancher-monitoring-crd chart.

For some more context, the ServiceMonitor is a custom Kubernetes resources and part of the prometheus-operator controller. The controller looks at the resource and configures Prometheus to scrape an additional target, which in this case will be fleet. If anything inside this resource is wrong, it is not expected to have an effect on any other resources of the same kind. It would be surprising to see that scraping these amounts of additional metrics would have a significant performance impact, but looking at it long-term could potentially increase the storage space required for storing metrics. That said, Prometheus is by default configured to retain the data for only 15 days (and a default retention size in the rancher-monitoring chart of 50G), so that this aspect should also be negligible. The scraped metrics could potentially conflict with other metrics and cause a mess, for which reason they are prefixed with fleet_, making conflicts virtually impossible.

Backporting considerations

This change does not need to be backported to other versions. This is a new feature in fleet and no plans exist to backport it.

The probability of this change introducing regressions is low, as it simply extends already implemented functionality by a rather simple resource, which is part of the rancher-monitoring-crd chart.

Backporting considerations

This change does not need to be backported to other versions. This is a new feature in fleet and no plans exist to backport it.

github-actions · 2024-05-21T08:13:32Z

Validation steps

Ensure all container images have repository and tag on the same level to ensure that all container images are included in rancher-images.txt which are used by airgap customers.

  Ex:-
    longhorn-controller:
      repository: rancher/hardened-sriov-cni
      tag: v2.6.3-build20230913

Add a 👍 (thumbs up) reaction to this comment once done. CI won't pass without this reaction to the github-action bot's latest validation comment.
Approve the PR to run the CI check.

github-actions · 2024-05-22T09:49:36Z

Validation steps

Ensure all container images have repository and tag on the same level to ensure that all container images are included in rancher-images.txt which are used by airgap customers.

  Ex:-
    longhorn-controller:
      repository: rancher/hardened-sriov-cni
      tag: v2.6.3-build20230913

Add a 👍 (thumbs up) reaction to this comment once done. CI won't pass without this reaction to the github-action bot's latest validation comment.
Approve the PR to run the CI check.

github-actions · 2024-05-22T12:28:15Z

Validation steps

Ensure all container images have repository and tag on the same level to ensure that all container images are included in rancher-images.txt which are used by airgap customers.

  Ex:-
    longhorn-controller:
      repository: rancher/hardened-sriov-cni
      tag: v2.6.3-build20230913

Add a 👍 (thumbs up) reaction to this comment once done. CI won't pass without this reaction to the github-action bot's latest validation comment.
Approve the PR to run the CI check.

Refers to rancher/fleet#2295

github-actions · 2024-06-06T07:55:54Z

Validation steps

Ensure all container images have repository and tag on the same level to ensure that all container images are included in rancher-images.txt which are used by airgap customers.

  Ex:-
    longhorn-controller:
      repository: rancher/hardened-sriov-cni
      tag: v2.6.3-build20230913

Add a 👍 (thumbs up) reaction to this comment once done. CI won't pass without this reaction to the github-action bot's latest validation comment.
Approve the PR to run the CI check.

thehejik · 2024-06-06T12:02:23Z

Test report

Apart that rancher-monitoring 104.0.0-rc1+up45.31.1 is broken and cannot be directly installed into Rancher v2.9-b456233ab32b27b221d14244df7b0223eacfe078-head the PR works as expected. I could see the fleet-controller target under Prometheus.

As a workaround I did an upgrade from previous version 103.1.0+up45.31.1 which worked.

Environment

single node k3s v1.28.6+k3s2 local cluster with rancher v2.9-b456233ab32b27b221d14244df7b0223eacfe078-head
Fleet version: fleet:104.0.0+up0.10.0-rc.14

Test

Enabled Include Prerelease versions in Rancher Preferences
Add App repository from this PR. git repo: https://github.com/p-se/rancher-charts.git branch: add-fleet-smon
First install monitoring 103.1.0+up45.31.1 then upgrade to 104.0.0-rc1+up45.31.1 from the Repository defined above.
on Local go to Monitoring -> Prometheus Targets and after a while a new metrics endpoint for fleet will appear there

github-actions · 2024-06-06T14:49:01Z

Validation steps

Ensure all container images have repository and tag on the same level to ensure that all container images are included in rancher-images.txt which are used by airgap customers.

  Ex:-
    longhorn-controller:
      repository: rancher/hardened-sriov-cni
      tag: v2.6.3-build20230913

Add a 👍 (thumbs up) reaction to this comment once done. CI won't pass without this reaction to the github-action bot's latest validation comment.
Approve the PR to run the CI check.

thehejik · 2024-06-06T14:56:13Z

The installation problem of 104 has been fixed by #4026

Now I could install the version directly and the fleet target is there.

…er-monitoring` chart (rancher#3949)

p-se · 2024-06-21T15:28:39Z

Relates to rancher/fleet#2460

…er-monitoring` chart (rancher#3949)

p-se self-assigned this May 21, 2024

p-se requested a review from a team as a code owner May 21, 2024 08:13

p-se force-pushed the add-fleet-smon branch from 53a6d50 to e054e26 Compare May 22, 2024 09:34

p-se force-pushed the add-fleet-smon branch from e054e26 to 44c5e1a Compare May 22, 2024 12:28

manno mentioned this pull request Jun 5, 2024

Add metrics config to rancher-monitoring chart rancher/fleet#2295

Closed

prachidamle requested review from joshmeranda and alexandreLamarre June 5, 2024 21:51

alexandreLamarre approved these changes Jun 5, 2024

View reviewed changes

p-se added 2 commits June 6, 2024 09:35

Add ServiceMonitor for Fleet metrics in rancher-monitoring chart

d611d7b

Refers to rancher/fleet#2295

make charts

1ecc7b8

p-se force-pushed the add-fleet-smon branch from 44c5e1a to 1ecc7b8 Compare June 6, 2024 07:55

p-se changed the title ~~Add ServiceMonitor for scraping fleet-controller in rancher-monitoring chart~~ [dev-2.9] Add ServiceMonitor for scraping fleet-controller in rancher-monitoring chart Jun 6, 2024

p-se changed the title ~~[dev-2.9] Add ServiceMonitor for scraping fleet-controller in rancher-monitoring chart~~ [dev-v2.9] Add ServiceMonitor for scraping fleet-controller in rancher-monitoring chart Jun 6, 2024

Merge branch 'dev-v2.9' into add-fleet-smon

adb1d89

joshmeranda approved these changes Jun 6, 2024

View reviewed changes

thehejik merged commit c74ab29 into rancher:dev-v2.9 Jun 6, 2024
6 checks passed

skanakal pushed a commit to skanakal/charts that referenced this pull request Jun 7, 2024

[dev-v2.9] Add ServiceMonitor for scraping fleet-controller in `ranch…

6aba3d2

…er-monitoring` chart (rancher#3949)

krunalhinguu pushed a commit to krunalhinguu/charts that referenced this pull request Jul 15, 2024

[dev-v2.9] Add ServiceMonitor for scraping fleet-controller in `ranch…

ec37c6e

…er-monitoring` chart (rancher#3949)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev-v2.9] Add ServiceMonitor for scraping fleet-controller in `rancher-monitoring` chart #3949

[dev-v2.9] Add ServiceMonitor for scraping fleet-controller in `rancher-monitoring` chart #3949

p-se commented May 21, 2024 •

edited

Loading

github-actions bot commented May 21, 2024

github-actions bot commented May 22, 2024

github-actions bot commented May 22, 2024

github-actions bot commented Jun 6, 2024

thehejik commented Jun 6, 2024

github-actions bot commented Jun 6, 2024

thehejik commented Jun 6, 2024

p-se commented Jun 21, 2024

[dev-v2.9] Add ServiceMonitor for scraping fleet-controller in rancher-monitoring chart #3949

[dev-v2.9] Add ServiceMonitor for scraping fleet-controller in rancher-monitoring chart #3949

Conversation

p-se commented May 21, 2024 • edited Loading

Issue:

Problem

Solution

Testing

Engineering Testing

Manual Testing

Automated Testing

QA Testing Considerations

Regressions Considerations

Backporting considerations

Backporting considerations

github-actions bot commented May 21, 2024

Validation steps

github-actions bot commented May 22, 2024

Validation steps

github-actions bot commented May 22, 2024

Validation steps

github-actions bot commented Jun 6, 2024

Validation steps

thehejik commented Jun 6, 2024

Test report

Environment

Test

github-actions bot commented Jun 6, 2024

Validation steps

thehejik commented Jun 6, 2024

p-se commented Jun 21, 2024

[dev-v2.9] Add ServiceMonitor for scraping fleet-controller in `rancher-monitoring` chart #3949

[dev-v2.9] Add ServiceMonitor for scraping fleet-controller in `rancher-monitoring` chart #3949

p-se commented May 21, 2024 •

edited

Loading